Overview

This is a project to look at transcription in the growth cone vs. the soma, initiated by Alexandros Poulopoulos from the Macklis lab. The experimental setup is a little complex, there were only a small number of actual biological replicates run but several technical replicates. The technical replicates come in two flavors. The first is that the samples were run multiple times, there was some overclustering problems on the lanes that lead to a low number of reads, so the libraries were rerun. The second is that a small set of the samples only had half of the sample loaded, the idea was to look at the quantitative range.

There are paired growthcone-soma samples that are paired via sort date, and there are three sort dates, so there are three biological replicates. We will block on sort date in the model.

We’ll simply add the counts together for the reruns of the samples if they look to be highly correlated and don’t seem to have a batch effect. For the half loaded samples, we only have half amplified samples for the growthcone, and only from two two dates, Feb15 and Jun15. I don’t really know how we can use this information other than to just compare it to the non-half loaded samples and see how well the correlation is. We’ll probably drop those when we do the analysis.

Quality control metrics

Mapped reads

We can see here that there is quite a bit of variation between the mapped reads that persists across the 3 different runs. BC9 in particular has a very low number of reads mapped.

Genomic mapping rate

Mapping rate is a better metric of if there is something wrong with the libraries. Here we see that BC14 and BC9 look pretty bad for their mapping rate but overall the mapping rate is lower than we would expect, we’d normally be expecting about 90% of the reads to map. This could be due to some kind of adapter contamination issues we are not addressing when making the samples. What kit was used to make these libraries? What adapter sequences can we expect to see on the end of the sequences?

We can see from the above plot that in general the growthcone libraries have a poorer mapping rate than the non-growthcone libraries.

Number of genes detected

We detect on average less genes in the growthcone libraries, this might make sense if there are only a subset of the genes in the growthcones. This would be a more convincing argument if there was not one growthcone sample BC10 that has a high number of genes deteceted.

Gene detection saturation

This plot is to help us figure out if we are saturating the number of genes robustly detected in sequencing. Here we called detected as having counts > 10, rather than counts > 0. It looks like both the growthcone and the soma libraries cap out at the same rate, which is not what we would be expecting if the growthcone libraries were a smaller set of RNA than the soma libraries.

Exonic mapping rate

Another way to determine what went wrong with the sequencing is to look at the rate mapped reads map to exons. Here we see that for BC9 of the reads that map, a small amount map to exons. This, coupled with the low mapping rate generally indicates there was very little RNA in this sample and most of what got sequenced was either adapter sequence or genomic DNA.

rRNA mapping rate

rRNA mapping rate is reasonable for all the samples. Again we see BC9 with a very small rate, since it seems like whatever is mapping in there is not from RNA.

Estimated fragment length of paired-end reads

We can see more problems with BC9– the fragment size is very large, that means we likely had large pieces of DNA leftover that the libraries were made from. We can also see some problems with BC14, the estimated fragment length is negative. This means that the average fragment length is less than the read length, which is an indication of degraded RNA. That might be why this sample looks a little worse than the others.

When we look at BC9, about 40% of the sequence is Truseq adapter sequence. We trim this off during our processing, so these usually indicate stacks of Truseq sequence when mulitple copies are stacked together. The BC14 samples, about 10% is is taken up by the Truseq adapater. Good libraries like BC10 don’t have the sequence.

All of these lines of evidence point towards BC9 not having RNA in the sample and BC14 having degraded RNA.

Boxplot of log10 TMM-normalized counts per gene

Trimmed mean of M-values (TMM) normalization is described here

Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3). doi:10.1186/gb-2010-11-3-r25

Normalizing the counts by TMM-normalization can’t full normalize the samples.

PCA plot

Here we can see that the BC9 and BC14 libraries cluster together. These are likely because these libraries both have issues with adapter contamination that are symtpoms of having a low RNA content. Dropping these samples seems prudent. The BC6 sample also looks like it is an outlier from the other samples in the PCA plot, but it is different than the BC9 and BC14 samples. The BC6 sample was the other sample with a very low number of genes detected. This sample doesn’t have the same adapter contamination issue, but the GC content looks out of whack with a huge spike in the mean GC content instead of an even distribution.

The BC6 library looks like it is less complex than the other libraries, when we look at the sequence duplication plots. As an extreme example, lets look a the duplication plot for BC9:

BC9 duplication

BC9 was mostly adapter sequence and we can see there are reads that are duplicated over 10k times. For an example of what a good library looks like, lets look at BC10:

BC10 duplication

It is normal to see some duplication in RNA-seq just because there is only 2% of the genome that is sequenced, so you should see some duplication, just not sequences with a crazy amount.

BC6 has a big spike:

BC6 duplication

A sample that was like BC6 in terms of reads mapped and mapping percentage is BC5. That sample doesn’t have as much of a prominent spike:

BC5 duplication

These are all growthcone libraries. So the question is, is this sample what we are expecting to see when we look at the growthcones, a low number of genes detected? Or are the other samples where we see a high number of genes detected what we expect. I think that the BC6 library probably also had low RNA content, just not as low as BC9 and BC14, but it is still problematic. So going forward I’m going to drop BC9, BC14 and BC6 from the analysis. Hopefully that leaves us with enough data to do something with.

The BC6 and BC5 samples are the two samples that are in the half loaded group, so I think the takehome is don’t do that in the future.

PCA plot with filtered samples

There is some variability by sort date. The Feb15 samples separate out from the other samples along the 2nd principal component by sort date:

Unfortunately the growthcone sample from Feb15, the only sample that survived is BC5, which is the last remaining half loaded sample, so we might be making some apples to oranges type comparison with that one.

We can see the amount of starting material has an effect for the Feb15 soma samples but not so much for the Oct14 samples. We can just drop the samples with starting material = 50 for those.

Final touches

To recap we’re left with samples that separate like this on the PCA:

Each of those are groups of three from each run; the runs almost perfectly overlap so it is safe to combine them into one set of counts for each library. We’ll construct sample summary information and a count matrix of those combined samples. For the counts we’ll just add the counts together for each run.

That leaves us with this for the PCA:

and samples that look like this:

barcode barcode_id compartment half_amplified sort_date Name
BC10 GGAGAA BC10 growthcone no Oct14 BC10
BC12 GAGTCA BC12 soma no Oct14 BC12
BC16 TTGGCA BC16 soma no Feb15 BC16
BC5 ACCTCA BC5 growthcone yes Feb15 BC5
BC7 AAGCCT BC7 growthcone no Jun15 BC7
BC8 GTCGTA BC8 soma no Jun15 BC8

We’re a little stuck because one of the Feb15 samples is the half amplified sample so we’re comparing apples to oranges a little bit there.

Differential expression

For diffential expression we’ll drop all genes that have aren’t expressed in any of the samples, and analyze them with DESeq2, running a paired analysis looking at the effect of compartment, doing a paired analysis comparing only samples sorted on the same date to each other.

Effect of variance stabilization

Dispersion estimates

MA-plots

Volcano-plots

TableGrob (2 x 2) "arrange": 3 grobs
  z     cells    name                 grob
1 1 (2-2,1-1) arrange      gtable[arrange]
2 2 (2-2,2-2) arrange      gtable[arrange]
3 3 (1-1,1-2) arrange text[GRID.text.9097]

DEGreport

Pvalues-vs-Mean

Here we plot some information about how the p-values are correlated with the mean or the standard deviation. For these, within each quantile we should see an even distribution. We should see an encrichment for lower p-values if there is a set of genes differentially expressed and we see that as well.

Pvalues-vs-Variation

Mean-vs-Variation

Differentially expressed genes

Finally, we’ll take the sets of differentially expressed genes and do some work with them. We’ll attach some metadata about the genes to each gene and we’ll do some gene set enrichment analyses to see if certain sets of genes are differentially regulated.

There are NA differentially expressed genes. NA are higher in the growthcones compared to the soma and NA are lower in the growthcones compared to the soma.

Many of these differentially expressed genes we see only in the growthcone:

baseMean log2FoldChange lfcSE stat pvalue padj symbol
ENSMUSG00000019230 180.53565 9.878780 1.451184 6.807394 0.0000000 0.0000001 Lhx9
ENSMUSG00000061808 160.98345 9.707657 1.461727 6.641226 0.0000000 0.0000002 Ttr
ENSMUSG00000029306 187.62623 9.011205 1.439243 6.261072 0.0000000 0.0000016 Ibsp
ENSMUSG00000025927 125.02822 9.085669 1.512571 6.006771 0.0000000 0.0000046 Tfap2b
ENSMUSG00000030787 157.28547 8.726159 1.454612 5.998959 0.0000000 0.0000046 Lyve1
ENSMUSG00000097248 114.30502 8.145578 1.498458 5.435972 0.0000001 0.0000957 Gm2694
ENSMUSG00000074141 114.08149 7.913042 1.567218 5.049101 0.0000004 0.0003898 Il4i1
ENSMUSG00000027555 68.16616 7.664327 1.541014 4.973560 0.0000007 0.0005050 Car13
ENSMUSG00000099021 59.25460 7.469163 1.611454 4.635045 0.0000036 0.0016248 Rn7s1
ENSMUSG00000019929 80.11998 7.024879 1.613057 4.355011 0.0000133 0.0038787 Dcn
ENSMUSG00000028370 52.23069 6.948483 1.632888 4.255333 0.0000209 0.0050314 Pappa
ENSMUSG00000038805 38.32846 6.985116 1.646536 4.242311 0.0000221 0.0051548 Six3
ENSMUSG00000013584 52.43797 6.840155 1.636036 4.180930 0.0000290 0.0062612 Aldh1a2
ENSMUSG00000097292 61.17950 6.799249 1.641647 4.141723 0.0000345 0.0067261 A230107N01Rik
ENSMUSG00000022053 54.34164 6.737800 1.648701 4.086733 0.0000437 0.0075281 Ebf2
ENSMUSG00000057058 55.39098 6.706249 1.651297 4.061201 0.0000488 0.0080021 Skap1
ENSMUSG00000029697 38.05380 6.699841 1.659799 4.036538 0.0000542 0.0085115 Fezf1
ENSMUSG00000046352 41.33532 6.687360 1.656769 4.036385 0.0000543 0.0085115 Gjb2
ENSMUSG00000028487 25.83526 6.757819 1.678552 4.025981 0.0000567 0.0087185 Bnc2
ENSMUSG00000061132 48.65750 6.572985 1.661828 3.955275 0.0000764 0.0110179 Blnk
ENSMUSG00000054044 43.26129 6.584997 1.669434 3.944448 0.0000800 0.0110477 NA
ENSMUSG00000018698 33.07315 6.539978 1.662046 3.934896 0.0000832 0.0113686 Lhx1
ENSMUSG00000086382 58.42728 6.472519 1.650775 3.920896 0.0000882 0.0117880 Chrna1os
ENSMUSG00000026726 29.08378 6.452672 1.672313 3.858531 0.0001141 0.0132673 Cubn
ENSMUSG00000033794 67.28821 6.332658 1.649456 3.839240 0.0001234 0.0139188 Lpcat2b
ENSMUSG00000059246 33.75503 6.442316 1.685811 3.821493 0.0001326 0.0145591 Foxb1
ENSMUSG00000060969 49.80389 6.324498 1.669439 3.788397 0.0001516 0.0161802 Irx1
ENSMUSG00000039109 28.28900 6.387730 1.701101 3.755056 0.0001733 0.0179026 F13a1
ENSMUSG00000066809 29.71418 6.306957 1.690107 3.731691 0.0001902 0.0188557 Gm10180
ENSMUSG00000034384 27.64094 6.292593 1.692159 3.718676 0.0002003 0.0195390 Barhl2
ENSMUSG00000087343 53.31006 6.173489 1.667612 3.701993 0.0002139 0.0198816 1700021N21Rik
ENSMUSG00000034362 39.97925 6.183420 1.703384 3.630080 0.0002833 0.0228647 Csta1
ENSMUSG00000097960 43.18918 6.136071 1.688677 3.633656 0.0002794 0.0228647 A330074K22Rik
ENSMUSG00000086526 38.77529 6.052672 1.710329 3.538894 0.0004018 0.0276130 Gm12762
ENSMUSG00000058952 27.11991 6.085238 1.725598 3.526451 0.0004212 0.0279861 Cfi
ENSMUSG00000065767 31.17731 5.994285 1.726391 3.472148 0.0005163 0.0317351 Gm23849
ENSMUSG00000053747 47.12911 5.845555 1.691006 3.456851 0.0005465 0.0332597 Sox14
ENSMUSG00000012819 27.31586 5.866267 1.742404 3.366766 0.0007606 0.0395985 Cdh23
ENSMUSG00000024810 20.78847 5.887744 1.759644 3.345986 0.0008199 0.0418220 Il33
ENSMUSG00000072789 42.95402 5.696333 1.704526 3.341887 0.0008321 0.0419225 Gm10420
ENSMUSG00000087575 22.66982 5.842407 1.755565 3.327936 0.0008749 0.0425490 Gm12976
ENSMUSG00000043286 27.03936 5.789444 1.748426 3.311231 0.0009289 0.0434164 Pnpla1
ENSMUSG00000031551 29.96437 5.737145 1.747520 3.283021 0.0010270 0.0460768 Ido1
ENSMUSG00000022949 12.59741 5.761572 1.787150 3.223888 0.0012646 0.0514771 Clic6
ENSMUSG00000078302 17.16787 5.615443 1.787999 3.140630 0.0016859 0.0602447 Foxd1
ENSMUSG00000030579 26.49493 5.489562 1.756698 3.124933 0.0017785 0.0627879 Tyrobp
ENSMUSG00000036907 22.88630 5.440797 1.762057 3.087754 0.0020168 0.0657423 C1ql2
ENSMUSG00000046714 17.21059 5.443551 1.784652 3.050203 0.0022869 0.0706344 Foxc2
ENSMUSG00000033249 12.32118 5.493753 1.802617 3.047655 0.0023063 0.0710575 Hsf4
ENSMUSG00000040592 18.08584 5.409538 1.798048 3.008562 0.0026249 0.0766703 Cd79b

and only in the soma:

baseMean log2FoldChange lfcSE stat pvalue padj symbol
ENSMUSG00000097330 53.34832 -7.736523 1.583414 -4.885976 0.0000010 0.0007071 Gm26672
ENSMUSG00000090066 53.50288 -7.686146 1.582836 -4.855932 0.0000012 0.0007752 1110002E22Rik
ENSMUSG00000072720 41.37656 -7.354547 1.614068 -4.556529 0.0000052 0.0020785 Myo18b
ENSMUSG00000078700 41.04595 -7.357684 1.615338 -4.554887 0.0000052 0.0020785 D030028A08Rik
ENSMUSG00000031015 35.55094 -7.031299 1.641255 -4.284098 0.0000183 0.0046031 Swap70
ENSMUSG00000024076 32.49559 -6.988414 1.647711 -4.241286 0.0000222 0.0051548 Vit
ENSMUSG00000076439 33.41303 -6.843497 1.650961 -4.145160 0.0000340 0.0067261 Mog
ENSMUSG00000022297 35.96862 -6.798469 1.664285 -4.084919 0.0000441 0.0075281 Fzd6
ENSMUSG00000003680 28.86612 -6.786555 1.667449 -4.070022 0.0000470 0.0079161 Taf6l
ENSMUSG00000043165 24.99438 -6.452523 1.694638 -3.807612 0.0001403 0.0151306 Lor
ENSMUSG00000032590 25.58625 -6.194192 1.674445 -3.699250 0.0002162 0.0198816 Apeh
ENSMUSG00000022562 23.06931 -6.297498 1.704904 -3.693755 0.0002210 0.0199988 Oplah
ENSMUSG00000052911 25.76963 -6.114397 1.676191 -3.647792 0.0002645 0.0221250 Lamb2
ENSMUSG00000023953 20.41719 -6.219618 1.719274 -3.617583 0.0002974 0.0232836 Polh
ENSMUSG00000036040 21.73913 -6.164578 1.720371 -3.583285 0.0003393 0.0251266 Adamtsl2
ENSMUSG00000074358 21.71465 -6.113547 1.709010 -3.577244 0.0003472 0.0254081 Ccdc61
ENSMUSG00000029918 19.35955 -6.115846 1.730016 -3.535139 0.0004076 0.0277308 Mrps33
ENSMUSG00000031907 17.83840 -5.981244 1.742541 -3.432484 0.0005981 0.0347170 Zfp90
ENSMUSG00000074576 16.93915 -5.930997 1.749967 -3.389205 0.0007010 0.0383108 Mocs3
ENSMUSG00000027932 17.13699 -5.910825 1.749346 -3.378877 0.0007278 0.0389007 Slc27a3
ENSMUSG00000001930 36.75336 -5.756988 1.710475 -3.365724 0.0007634 0.0395985 Vwf
ENSMUSG00000024942 18.56127 -5.873209 1.744486 -3.366727 0.0007607 0.0395985 Capn1
ENSMUSG00000096929 18.80221 -5.792251 1.751532 -3.306963 0.0009431 0.0437507 A330023F24Rik
ENSMUSG00000051235 17.12275 -5.741189 1.758212 -3.265356 0.0010933 0.0470265 Gen1
ENSMUSG00000031520 15.36256 -5.760815 1.767026 -3.260176 0.0011134 0.0473613 Vegfc
ENSMUSG00000034471 18.35723 -5.665601 1.743003 -3.250483 0.0011521 0.0483368 Caskin2
ENSMUSG00000043872 15.55832 -5.661474 1.770229 -3.198159 0.0013831 0.0534661 Zmym1
ENSMUSG00000030588 21.06924 -5.514464 1.742779 -3.164178 0.0015552 0.0574220 Yif1b
ENSMUSG00000022441 15.36754 -5.595616 1.775193 -3.152117 0.0016209 0.0587785 Efcab6
ENSMUSG00000022555 13.99173 -5.554287 1.784903 -3.111814 0.0018594 0.0633923 Dgat1
ENSMUSG00000031813 13.70398 -5.555615 1.787200 -3.108558 0.0018800 0.0633923 Mvb12a
ENSMUSG00000044229 16.03543 -5.519896 1.775394 -3.109111 0.0018765 0.0633923 Nxpe4
ENSMUSG00000097921 13.96588 -5.549020 1.785742 -3.107404 0.0018874 0.0633923 Gm26576
ENSMUSG00000074506 15.00148 -5.473133 1.783568 -3.068643 0.0021503 0.0679540 Gm10705
ENSMUSG00000052544 14.04804 -5.478582 1.788433 -3.063342 0.0021888 0.0686400 St6galnac3
ENSMUSG00000061740 13.53015 -5.460490 1.793256 -3.045014 0.0023267 0.0713268 Cyp2d22
ENSMUSG00000023805 13.03244 -5.457716 1.796544 -3.037896 0.0023824 0.0724910 Synj2
ENSMUSG00000039427 14.29382 -5.305392 1.781856 -2.977454 0.0029065 0.0809339 Alg1
ENSMUSG00000028583 12.70846 -5.343580 1.804119 -2.961878 0.0030577 0.0838398 Pdpn
ENSMUSG00000074715 12.04093 -5.313293 1.810639 -2.934486 0.0033410 0.0883503 Ccl28
ENSMUSG00000031337 12.49781 -5.281565 1.808758 -2.919995 0.0035004 0.0913588 Mtm1

Many of these are not genes that are just expressed at a low level. Many of them are in the upper quartile of the expression levels, so this isn’t just a case of a gene getting missed because it is expressed at a low level. Here are the quartiles of baseMean:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.2     11.1     67.2    343.0    245.8 916900.0 

I popped the gene lists into WebGestalt, separating them into up and down gene lists. There is a huge enrichment for proteins in the ribosomal machinery in the growthcones, there’s 86 genes in the gene list 42 of them are up in the growth cones. The focal adhesion KEGG pathway is also upregulated, they are several collagen subunits and laminin.

Down in growth cones are gene sets involved in aminoacyl-tRNA biosynthesis and RNA transport.

Those are just KEGG pathways, there are other types of gene set analysis you can do on WebGestalt and other types of tools. I included four lists of genes that you can use with these tools. The first list is of genes that are expressed, which I definied as having at least 10 counts total across all of the samples. The second list is a list of all of the differentially expressed genes, using a FDR cutoff of 0.1. If you don’t care about the direction of expression, you can load the DE list and then use the list of expressed genes as a background list for the enrichment tools. The two other lists are the DE list, but separated into up in growthcone or down in growthcone compared to the soma. You can use those lists similarly to look at genes specifically enriched or depleted from the growthcones. These are all Ensembl gene identifiers.

Those differentially expressed files and the expressed files are tarred up here

Summary

The results seem reasonable to me, being relatively enriched for RNA encoding ribosomal proteins makes a lot of sense for the growthcones. The data looks reasonable in terms of quality, with the caveats I talked about above. The growthcone data is widely variable, and future experiments might do well to do more replicates to have more power.